Software Co-design for the 
First Wafer-Scale Processor (and Beyond) 


Cerebras Systems 


Cerebras Wafer 
Scale Engine (WSE) 


The Most Powerful Processor for Al 


400,000 Al-optimized cores 
46,225 mm? silicon 

1.2 trillion transistors 

18 Gigabytes of On-chip Memory 
9 PByte/s memory bandwidth 
100 Pbit/s fabric bandwidth 
TSMC 16nm process 
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Architecture Designed for Deep Learning 


Each component optimized for Al compute 


Compute 
• Fully-programmable core, ML-optimized extensions 
e Dataflow architecture for sparse, dynamic workloads 


Fabric Switch 


Memory 
* Distributed, high performance, on-chip memory 


Communication 

* High bandwidth, low latency fabric 

e  Cluster-scale networking on chip 

• Fully-configurable to user-specified topology 


Together, orders of magnitude performance and 
efficiency gain 


Linear cluster-scale performance on a single chip 
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Programming the Wafer-Scale Engine 
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"Extract 7 Match 


Executable 


Framework 


Users program the WSE using standard ML frameworks, e.g. TensorFlow, PyTorch 


Cerebras Graph Compiler automatically compiles the DNN graph 
* Extracts from Framework, converts to Cerebras IR, performs matching to Cerebras kernels 


* Place & Route allocates compute and memory, configures on-chip network 


Enables straightforward programming, flexible execution, high performance 


Gerebras 


Matching to Kernel Library 


Graph matching from FW ops to Kernels: 


* Primitives to be sized and placed by rest of CGC 


* Expressed as nested for-loops for generality 


2 Kernel Types: 
1. Auto-generated 


* General and supports various operations 
* Polyhedral techniques 


* Unrolling loop dimensions across fabric 


2. Hand-optimized 
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* High-performance common kernels 


* Hand-tuned ucode and fabric programming 


MATMUL Op 


for i- 0...783: 
for j= 0...255: 
out@[j] += Ihs[i]*rhs[i][j] | 


MATMUL Kernel 


Choosing the Optimal Mapping Strategy 


Neural Network Kernels 


* Choose mapping strategy for each kernel 
* Model parallel — size and allocation of each kernel 
* Data parallel — replication factor 


e Strategy determines 
* Allocation of compute cores to kernel 
* Amount of memory to kernel 
* Optimal communication pattern 
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Automatically Exploring the Optimization Search Space 


0% Allocation 100% 0% Allocation 100% 
Compute: Compute: 
Memory: Memory: 
Fabric: Fabric: TN 
Network Perf: Ш Network Perf: | D 


Neural Network Kernels One possible allocation of the compute, A different allocation of compute, 
memory, and fabric to each kernel memory, and fabric to each kernel 


Gerebras 


Automatically Exploring the Optimization Search Space 
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Option 3x3: option: 3x6: а 6х6 Option 6x12 Option 12x12 
4X slower 2X slower 36 cores 2X faster 4X faster 
% area % area 2X area 4X area 
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Compute 
Kernels 
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Continuously Streaming Input 
Continuously Streaming Output | 


cerebras 


Co-Designed for Training Flexibility and Performance 


Compiler stack and hardware architecture co-designed 


Result: Flexibility and Performance 
1. Model parallel and data parallel execution 
2. Sparsity harvesting and inducing 


3. Dynamic neural network execution 


Grebras 


1) Flexible Parallelism 


* Optimization search enables spectrum of parallel execution strategies on WSE 
* Single algorithm uses both model and data parallelism in optimization 


* Execution strategies optimized for different network characteristics 


Data Parallel Model Parallel WSE Layer-Pipelined 
;| 99 
Device 1 Device 2 Device 1 Device 2 © 
© 
ET Layer-sequential 
P WSE GPU 
| | | | i * 9 Ф 
Running multiple Running multiple parts 
samples at same time of network at same time 


Data Parallel 


@ Ж (Batch Size) 


Using Model Parallelism 


e Run all layers on fabric sections 4-Layer BERT Performance, Fixed Batch Size - 16 
* Layers in parallel more performance 
* Execute network as pipeline 
* Enabled by high bandwidth interconnect 


* Small batch size 
* No need to replicate network 
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* No weight sync overhead 
* Weights updated in place 


Result: linear performance scaling with 
small batch size 


Gerebras 


Using Data Parallelism 


* Run layer replicas on fabric sections 
* Replicas in parallel = more performance 
* Applies to smaller layers/networks 


BERT Attention Kernel Performance 


* Not forced to large batch size 
* Small batch size per replica 


* Single sample execution enabled by 
memory performance at low batch 


* Larger batch by running multi-samples 
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• Low weight sync overhead 200k 


* Enabled by low latency and high Cores 
bandwidth interconnect 
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Data parallel replicas: —®—1 —е—2 4 ——8 


Result: linear performance scaling with 
medium batch size 
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2) Translating Sparsity into Training Performance 


Large number of zeros in neural network 


Sparse Network 


Non-linear 
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e Induce sparsity when not naturally occurring 
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* Nonlinears create activation sparsity 
* Harvest natural sparsity in neural network 


Kernels designed for sparsity 


Grebras 


Data а Ctrl J 


Core Designed for Sparsity 


Fabric Input 
Enabled by dataflow scheduling in hardware 


* Fabric data triggers instruction lookup 


* State machine schedules datapath cycles 


Intrinsic sparsity harvesting 
* Sender filters out sparse zero data 


* Receiver skips unnecessary processing 


Fine-grained execution datapaths 
* Small cores with independent instructions 


* High utilization for dynamic non-uniform work Fabric Output 


Gerebras 


Output 
Probabilities 


Add & Norm 
Forward 
Add & Norm 


Multi- Head 
Attention 


AES 
Add & Norm 


Natural Sparsity in Transformer 


* Transformer uses ReLU and Dropout 
non-linears 
* ReLU is 9096 naturally sparse 
e Dropout is 3096 naturally sparse 


Nx 


Add & Norm 
Forward 


* 1.2x perf gain vs. dense non-linear 


№ | Cada а Norm | 
and no dropout Add & Norm WT 
Multi-Head Multi-Head 
Attention Attention 
WE EE WEN 


Positional 
Ф © Encoding 


Positional 
Encoding 69 Ф 
Input Output 
Embedding Embedding 
Outputs 


Inputs 
Grebras (shifted right) 


Natural Sparsity in Transformer 


* Transformer uses ReLU and Dropout 
non-linears 


* ReLU is 90% naturally sparse Sparse 


* Dropout is 30% naturally sparse 


Feed-Forward 


* 1.2x perf gain vs. dense non-linear 


and no dropout Sparse 


Grebras 


Inducing Sparsity 


* Sparsity can be induced by 34-Layer FC Performance vs. Induced Sparsity 
* Adding sparse non-linear (e.g. ReLU) 


3 


* Dropping relatively small values 8 
! : ae 
* Inducing sparsity on 34-Layer dense = 
FC model A. 
> 
• 1.7x perf gain with ReLU 5 | | | Д 
А ac 
* 2.4x perf gain with ReLU+50% sparsity Dense, Dense, ReLU 25% sparse, 50% sparse, 


Identity ReLU ReLU 
ш СРО mCS-1 


Grebras 


Inducing Sparsity in BERT 


• BERT has no natural sparsity BERT Performance vs. Induced Sparsity 


* But sparsity can be induced on most 
layers in both fwd and bwd pass 

* Up to 1.5x perf gain with 50% sparsity 
and minimal accuracy loss 
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* ML user has control 


2596 5096 
Induced Sparsity 


Grebras 


3) Designed to Unlock Smarter Techniques and Scale 


WSE has a data flow architecture 


* Flexibility to stream token by token 
* |nherent sparsity harvesting 


WSE is a MIMD architecture 


* Can program each core independently 
* Perform different operations on different data 
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Flexibility Enables Dynamic ML Methods 


Fine-grained dynamic execution 
enables new ML techniques 


1. Variable sequence length 
* Stop at end of sequence, no padding 


2. Irregular/NAS models 


* High utilization for non-square matrices 


3. Recursive dynamic depth 
* Run enough layers to meet objective 


4. Dynamic (and long) sequence length 
* Process only relevant part of sequence 


Gerebras 


Depth 


Universal Transformer with Dynamic Depth 


Parameters are tied across positions and time steps 


Transition Function 
elf-Attention 
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Transition Function 
Self-Attention 


Transition Function 
Self-Attention 


Positions 


CS-1 HW/SW Co-design Enables Next Gen DL models 


The community wants smarter and larger models 


e CS-1is the most powerful single node 
e Automatic scaling through model and data parallelism 
e Accessible cluster-scale performance on a single chip 


e CS-1 is flexible and dynamic 
e  Fine-grained sparsity harvesting and induction 
e Novel adaptive & dynamic novel ML techniques 


This combination of flexibility and performance enables the next generation of models 
and techniques otherwise challenged today. 


Gerebras 


Wafer Scale Engine — Generation 2 


850,000 Al-optimized cores 
2.6 Trillion Transistors 


TSMC /nm Process 


